examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving by mergennachin · Pull Request #20207 · pytorch/executorch

mergennachin · 2026-06-10T21:38:31Z

Adds the Gemma 4 31B serving path, mirroring qwen3_5_moe: a CUDA
Engine/Session adapter (chunked prefill, per-session mutable rebinding,
in-graph sampling) behind the model-agnostic LLMEngine/LLMSession
contract, a JSONL worker, and a serve.py launcher. The generic worker
loop gains an optional prompt_prefix_ids (Gemma BOS prepend) and
serving_chat a matching prompt_token_offset so the context count stays
honest. export.py emits get_mutable_buffer_metadata and prefill-chunk
bounds for multi-session.

[ghstack-poisoned]

mergennachin · 2026-06-10T21:38:32Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-10T21:38:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20207

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 6 New Failures, 10 Pending, 2 Unrelated Failures

As of commit 142ff89 with merge base f0dff03 ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-models-linux (dl3) / linux-job (gh)
RuntimeError: Command docker exec -t 9749910fa02d3511d5abc2e705f3bf70b492507eede0456c8cc6dd1f2d5831ec /exec failed with exit code 92
pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh)
RuntimeError: Command docker exec -t eb6b218fe46c44f30b0780d7d44cf46cf5d93c703e7581f06d5c1009b751e833 /exec failed with exit code 92
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 900acb2098849bf67ee1c2a7c50613b3e9a716c0ad845e77188413408b9cdee9 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 663d0bfc1a6389f34636a45218af97012ac553900894db577552d54c8da18b2c /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

MLX / test-mlx-voxtral-realtime / test-mlx-voxtral-realtime (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-06-10T21:39:02Z

The committers listed above are authorized under a signed CLA.

✅ login: mergennachin / name: Mergen Nachin (1a1b822)

[ghstack-poisoned]

Adds the Gemma 4 31B serving path, mirroring qwen3_5_moe: a CUDA Engine/Session adapter (chunked prefill, per-session mutable rebinding, in-graph sampling) behind the model-agnostic LLMEngine/LLMSession contract, a JSONL worker, and a serve.py launcher. The generic worker loop gains an optional prompt_prefix_ids (Gemma BOS prepend) and serving_chat a matching prompt_token_offset so the context count stays honest. export.py emits get_mutable_buffer_metadata and prefill-chunk bounds for multi-session. ghstack-source-id: c21c647 ghstack-comment-id: 4674805750 Pull-Request: #20207

[INITIAL] Update

1a1b822

[ghstack-poisoned]

mergennachin requested review from GregoryComer, JacobSzwejbka, SS-JIA, abhinaykukkadapu, digantdesai, kimishpatel, kirklandsign, larryliu0820, manuelcandales, psiddh, rascani, robert-kalmar and shoumikhin as code owners June 10, 2026 21:38

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 10, 2026

mergennachin marked this pull request as draft June 12, 2026 19:45

[UPDATE] Update

142ff89

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving#20207

examples/models/gemma4_31b: CUDA Engine/Session adapter + OpenAI serving#20207
mergennachin wants to merge 2 commits into
gh/mergennachin/12/headfrom
gh/mergennachin/13/head

mergennachin commented Jun 10, 2026

Uh oh!

mergennachin commented Jun 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented Jun 10, 2026

Uh oh!

mergennachin commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20207

❗ 1 Active SEVs

❌ 6 New Failures, 10 Pending, 2 Unrelated Failures

Uh oh!

linux-foundation-easycla Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented Jun 10, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 10, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 10, 2026 •

edited

Loading